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Abstract — We propose a learning algorithm for nonparametric 
estimation and on-line prediction for general stationary ergodic 
sources. The idea is to prapare many histograms and estimate 
the probability distribution of the bins in each histogarm. We 
do not know a priori which histogram expresses the true 
distribution: if the histogram is too sharp, the estimation captures 
the noise too much (overestimation). To this end, we weight those 
distributions to obtain the estimation of the true distribution. As 
long as the weights are positive, we obtain a desired property: 
the Kullback-Leiber information divided by the number n of 
examples diminishes as n grows. 

Index Terms — nonparametric estimation, on-line predic- 
tion, stationary ergodic, Shannon-MacMillan-Breiman, universal 
docoding. 

I. Introduction 

In machine learning, the problems of estimating a prob- 
abilistic rule given examples and predicting next data from 
the past sequence have been intensively considered for many 
years. However, it seems that most of the effort has been 
devoted to the case assuming that each example consists of 
attributes taking finite values. This paper deals with nonpara- 
metric estimation and on-line prediction assuming that the data 
sequence available for learning have been emitted by a general 
stationary ergodic process. 

In this paper, by nonparametric estimation, we mean to cap- 
tures the stochastic process that has emitted a given sequence 
without assuming that each random variable takes a finite 
value. If the random variables take finite values, then we only 
need to relative frequencies or its modifications to estimate 
the conditional probability P{xj\x\, ■ • ■ ,Xj-i) of next data 
Xj given the past sequence x\, ■ ■ • , Xj—\, so that the whole 
distribution 

n 

P(xx,--- ,x n ) = JJP(xj|a;i,-- - ,Xj- X ) 

3=1 

can be estimated exactly as the sequence length n grows, 
which is assured by the law of large numbers. However, we 
are not sure what to do in general situations in which {Xj}°^ 1 
may not take finite values. A similar situation will happen in 
on-line prediction that needs to estimate P(xj\x\, ■ ■ ■ , Xj—\) 
as well. 

To this end, we take an information theoretical approach. 
Suppose we wish to construct an algorithm to compress data 
sequences as short as possible (the compressed data should be 
uncompressed to the original via another prescribed algorthm). 
The best way would be to utilize the probability P of the whole 
sequences to encode each sequence based on the framework of 
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information theory (6), (3). However, if such knowledge is not 
available, we need to estimate P from one given sequence. If 
the process obeys the law of large numbers, say for stationary 
ergodic sources, the whole relative frequencies converges to 
definite values so that we can construct an estimation Q of 
the original unknown P. Then, we can encode the given one 
sequence x\, ■ ■ • ,x n length roughly 

-logQ"(a;i, • • • ,x n ) 

based on the Q. Thus, the goal is to minimize the difference 
D(P n \\Q n ) if the expectation 

£[- log ,!„)] 

should be short, where -D(-||-) is the Kullback-Leibler infor- 
mation J5] which is useful for evaluating estimation errors. 
Recent information theory covers the case not assuming the 
a prior knowledge of the stochastic nature of the sequences 
to be encoded (universal coding). We call a stochastic process 
simply a source as used in information theory |6). 

One might think why only the minimization of 
D(P n \\Q n ) is considered as a criterion of estimation and 
prediction while many other ways to evaluate the performaces 
are available, for example, 

■ P {Xn \ X\ , , X n — i ) Q (x n \ x\ , , X n — \ ) 

may be a better alternative to D(-\\-). However, it is known 
|7||8| that for any estimation Q n , there exists a stationary 
ergodic source P" such that limsup,^^ e n > with proba- 
bility one. 

Now let us we focus on the continuous data. Recently, due 
to Boris Ryabko (9J, a great progress has been made on the 
problems of nonparametric estimation and on-line prediction 
for continuous data. The idea is to estimate the probability 
density function for the source {X,}"^. Let {Aj}°^ be an 
increasing sequence of finite partitions of ML Then, we can 
estimate the probability P" of the concatinations of Ai of 
length n to construct the estimation Q™ for each i. If we divide 
the original P" and estimation Q™ by the volume of dimension 
n, then we can obtain the probability density function /" and 
its estimation g™ for each i, respectively. Then, if we mixture 
the estimations gf by nonzero weights {wj} such that J^. cj, 
1 and u); L > 0, then g n := J^i^iS? can ^ e an estimation of 
the probability density function /". Ryabko proved that such 
an estimation g n has a disirable properties for nonparametric 
estimation and on-line prediction such as D(f n \\g n )/n — > 
(n — > oo). 

However, several problems remain even in Ryabko's inspir- 
ing framework: 
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1) The probability density function / should exist. 

2) The differential entropy of f i should converge to that of 

/ as i — » oo. 

Example 1: Suppose that random variable Xj indepen- 
dently obeyes the distribution function 

[ x < -1 

FxM = \ I -l<x<0 
{ /; \g{t)dt < x 

with J 00 g(x)dx = 1. Then, does neither take a finite 
value, nor have fx, such that 



i(Si(x)) 



and 



for i£l although {Xj}J!L 1 is stationary ergodic. 

The proposed method generalizes the finite and continous 

cases as special cases. 

For the proposed learning algorithm for nonparametric 
estimation and on-line prediction, we do not have to know a 
priori whether the source is discrete or continuous, whether the 
source has a probability density function when it is continuous. 
We only need to know the source is stationary ergodic, and 
the same algorithm can be applied to given data if the source 
is either discrete or continuous. The derivation is very simple: 
unlike the original paper by Ryabko (9), this paper applies A. 
Barron's generalized Shannon-MacMillian-Breiman theorem 
to obtain the general result. 

Section 2 illustrates the main result using a typical example. 
Sections 3 and 4 give existing results for finite and continuous 
sources, respectively. In particular, Section 4 summarizes 
Ryabko's original results on nonparametric estimation and on- 
line prediction without proof. Those results in the two sections 
are extended to the general case in Sections 5. Section 6 gives 
how the main result works in several cases including when 
the existing methods cannot deal with. In Sections 7, the main 
result is applied to on-line prediction. Section 8 concludes this 
paper with future topics. 

II. The idea: Histogram Weighting 

We first illustrate the main result using a typical example. 
Suppose {Xfc}£ =1 is Independent and identically-distributed 
and takes values in Xj(f2) C [0,1). We wish to estimate 
the density function / from examples Xk — Xk G [0, 1), 
k = 1, •■• ,n. To this end, we prepare several histograms as 
follows. 

Level 0: A = {[0, 1/2), [1/2, 1)} consisting of two bins 
Level 1: A, = {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1)} 
consisting of four bins 



ESo^iSW^) with histograms g n ,i(x) ■= 
weights {uii}°l such that C0j > and Y^o^i = ^> where 
Si : [0, 1) Ai is the projection. 

Ryabko proved that the Kullback-Leibler information be- 
tween / and g„ converges to zero as n — > oo. However, what 
if the density function / does not exist as in Example Q]? That 
is the problem we address in this paper. 

III. Finite Sources 

Let {X n }^ =1 be a stationary ergodic source expressed by 
probability P°° generating each X n in a finite set A. The 
entropy of the source is defined by 

H{P°°):= lim -- V P n {x n )\ogP n (x n ) 
n— >oo ji — ' 

(such a limit always exists for stationary sources). 

By coding, we mean any mapping ip n : A n — >■ {0, 1}* 
satisfying 

<p n {x n ) = <p n (y n )=>x n = y n 
for x n ,y n 6 A n . It is known that 

V 2~ |l/ '" (:E " )l < 1 

x n £A n 

(Kraft's inequality) for such (p n , and that if l n : A n ->• N 
satisfies 

Y, 2- ,n <* B > < 1 , 

there exists a coding cp n such that |<^™(x n )| = l n (x n ) for 
x n G A n , where we denote |z| = m when z € {0, l} m . We 
can construct (p n from P n such that [6] 

lim fl ly"^ •••,*»)! = H(pn 

n— >oo TL 

Even without knowledge of P n , we can construct a coding 
(/?" satisfying 



Level i: Ai consisting of 2 i+1 bins 



For each level i = 0, 1, • • •, we can estimate the discrete 
distribution by counting the frequencies of 2 i+1 bins to obtain 
the estimate Q n ,i{a) of the bin probabilities Pi(a) for a e A t , 
i = 0, 1, 2 • • • . We do not know a priori which histogram Q n , 
is the closest among i = 0, 1, • • ■ to the true distribution P. 
Then, we approximate the density function f(x) by g n (x) = 



lim 



\<P*(xi, ■■■ ,x n ) 



with probability one and 



lim E"- 



H{P°°) 



H{P°°) 



(1) 



(2) 



for all stationary ergodic P°° (universal coding) 0. 
Lemma 1 (Shannon-MacMillan-Breiman [6J): 



lim • 

n— f oo 



logP"(a;i 



= H(P°°) 



with probability one for all stationary ergodic P°°. 
Let 

Q n (x u --- ,<) :=2-^"^'-^)l , 

for xi, ■ ■ ■ ,x n e A. From (HJlfll) and Lemma [T] we have the 
following proposition. 

Proposition 1 (Ryabko [8]): 1) 



Ilo Pn{ - Xl >' 
n Q n (x\, ■ 

with probability 1, and 







(3) 
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2) 



I V P n (x 1} --- ,x„)log 



.P (xi , • • • , Xfi) 
Q n j * " * ) ) 



-» o 

(4) 



£1,-" 

for all stationary ergodic sources P°° as n — > 00. 

IV. Continuous Sources 

This section summarizes Boris Ryabko's pioneering work 
on continuous sources with a probability density function. 

Let {X n } ( ^L 1 be a stationary ergodic source expressed by 
probability density function /°° generating each X n . The 
differential entropy of the source /°° is defined by 

h{D:= lim -- [ r(x n )logf n (x n ) 

(such a limit always exists for stationary sources). 

Let {Ai}°^ be an increasing sequence of finite partitions 
of K that asymptotically generates the Borel er-field £>, so 
that each element in Ai is a non-empty disjoint subset of R. 
Let Si : R — > Ai be the projection x 1— > a 3 x. For each 
i = 0, 1, • • •, since P°° is stationary ergodic over the finite 
alphabet A i7 we can construct a universal coding ip™ for A™, 
and define 

Q?(oi,-.- ,a n ) := 2-K( Ql — , 
for a±, ■ ■ • , a„ G A^. For each i = 0, 1, • • • , we define 

Pf(s i (xi) J ■ • • ,Si(x n )) 



and 



9i(xi, ■■■ ,x n ) :-- 



• • • ,Si(x n )) 
Q™ (sj(xi), ■ ■ ■ ,Sj(x n ) 

Xi(Si(xi), ■ ■ ■ ,Si(x n )) 



for (a;i, • • • , x n ) G R", where A"(ai, ■ • • , a n ) is the Lebesgue 
measure of {ai,-- - ,a n ) G A™. Let {oJi}°Z Q be such that 
Si=o w j — 1 anc l w i > 0- Then, we can define the density 
estimation g n as follows: for {x\, ■ ■ ■ ,x n ) G R™, 



00 



g n {xi,-- - ,x n ) := 2_^uJigi(xi,--- ,x n ) 

Lemma 2 (Shannon-MacMillan-Breiman fiTl): 

lim ' l0g/ " (l "'"- I " ) = h(f°°) 



(5) 



(6) 



with probability one for all stationary ergodic f°°. 
We also consider the differential entropy h(f°°) of the sta- 
tionary ergodic source f°° for each i = 0, 1, • • • If we apply 
Lemma |2] to the probability density function f°°, we have 



lim • 

n— >oo 



log/"(xi,--- ,x n ) 



with probability one. 

Proposition 2 (Ryabko [9]): Suppose 



Then, 



lim h{fn = M/°°) 



1 , f [xi,--- ,x n ) 

~ log — ; r -> 

n g n {xx, ■ ■ ■ ,x n ) 



(7) 

(8) 
(9) 



with probability 1, and 

1 f f n ( x l, ■ ■ ■ , x n) 

f n {xi,- ■ ■ ,x n )dxi ■ ■ -dx n log — ■, — ■ ■ y 



g n {xi,- ■■ ,x n ) 



(10) 



as n — > 00. 



V. NonParametric Estimation for General 
Sources 

Let {X n ] c ^ =1 be a stationary ergodic source expressed by 
measure /j 00 generating each X n . 

Let {Ai}°l be a sequence of finite partitions of R such 
that Ai + i is a refinement of Ai, so that each x G Ai is a 
non-empty disjoint subset of R. For each i = 0, 1, • • • , since 
Pj 00 is stationary ergodic over the finite alphabet Ai, we can 
construct a universal coding 93™ for A™, and define 

Q>!,-- - ,On) :=2-\«^-^ , 

for a%,--- ,a n € A^. For measure 77" : £>™ — >• R such that 
/i" << 77" and each i = 0, 1, • ■ • , we define 



and 



E 

ai ,■ ■ ■ ,£1™ £Ai 



E 



,A») 






• , a„ n AO 


r) n (ax, ■ ■ 




,D n ) 




r 1 n {a 1 r\D 1 ,-- 


• ,a„ n L>„) 


77" (ai, • • 





Pf(ai,--- ,a„) 



'(ai, - • • ,«») 



for (£>!,-.. ,£>„) 

Let {wi}^g be as in the previous section. We define the 
density estimation v n as follows: for {D\,--- ,D n ) G B n , 
i = 0,l,.-., 

OO 

*/"(£>!,••• ,U n ) -^^(Di,--- ,£>„) (11) 

i=0 

Remark 1: When /i n << r/ n — A", if we differentiate (ITTb 
with A n , we obtain 



dX 
If we put 

and 



— (asi,-- - ,a;„) = 2^^ — (xi,--- ,x„) 



i=0 



5(^1 ! ' ' ' j ■ — ("El j ' ' ' ) x n) 



dv 11 

Qifel) " ' j ■ — ' ' ' ? ) j 

then we see that (fTTT i becomes (|5]l when /i" << A". 

Notice that for any sequence of universal codes {f"}°Z , we 
have u n {D n ) > for all D n G {23 -{}}", so that /j™ « v n . 

Proposition 3 (Barron Let (fijj 7 , /i) and ^ be the 
probability space and a c-fnite measure. Then, 

1 da n 1 
-log^i, - ,x n ) -> D(ji\\v) := lim -P>(/i fe ||iy fe ) 

(12) 
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as n — > oo with probability one, where 

£>( M "|K):= J d^log^£>0 

if i/ n (fi) < 1. 

From Proposition [3] since rf 1 (17) < 1, we have 

1 da n 

-log^ir" ,a;n) -+D(n\\r)) > 
77 dry™ 



(13) 



and 



1 du n 

-log-f^ii,--- ,x n ) -> D(jh\\v) >0 



as 77 — > 00 with probability one. 
Theorem 1: Suppose 



Then. 



and 



lim D(pi\\ri) = D(p\ \rj). 



77 ai/™ 



(14) 

(15) 
(16) 



D(n\\v)=Q. 
Proof. For any integer 7, we have 

u n (Di, ■■■ , D n ) > WjZ/f (£>!, • • • , D n ) 
for any (_D 1; • • • , D n ) S S™. which is equivalent to 

with probability 1. Hence, 

1 d// 1 

n dv n 

1 1 d/i" 

< logw i + -log—^(xi,--- ,x n ) 

n n dv[ 

with probability 1 for each i = 0, 1, The first term 
converges to zero. Since pP° is stationary ergodic, so is 
for each i = 0, 1, ■ • • . Thus, from Proposition Q] 

llog^-' a "Uo 
n Qi(ax, ■■■ ,a n ) 

with probability 1 as n — > 00, so that we may write 

PP(ai, ■■■ >a n ) = Qi{a>i, ■ ■ • ,a„)/ i (ai, • • • ,a n ) 
with /j(ai, • • • , an) 1 /™ — )■ 1. If we define 



(17) 



Ki. n := max /^ai, • • • ,a n ) , 

ai , ■ ■ ■ ,a n £Aj 



VI. Examples 

The condition (fT~4-b in Theorem 1 exactly specifies what 
{Ai}°g should satisfy. 

Example 2 (Finite Sources): Since 



and 



M"(ai, • • • ,a n ) = P n (ai, ...,a n ) 



v n (ai,--- ,a„) = Q n (ai,...,a n ) 



for ai, • • ■ , a n G A, (fT5b(fT6ll extend ©©, respectively, and 
condition (TBl i is always met. 

Example 3 (Continuous Sources): The equations 
<0(Q~5]l<[l6ll extend ©©([El in Proposition |2] In fact, 
when /i" << A™ and /if << A™, we have 



d/i™ 



and 



dA 



d/i™ 

^ \ rt (^l ) ' ' ' j Xn) = fi (xi, , X n ) . 



For example, ( TT4"1 > implies 



lim lim — log 



, 



which further from ©(O implies ([8]). In particular, if the 
source is i.i.d. and that the density function 

n 

f n {xi,--- ,x n ) = Y[f(Xj) 

3=1 

is both continuous and positive only when a < Xj < b, j = 

1, • • • ,71, then for Ai := {cq, • • • , 02^+1-1} with Ck = [a + k ■ 
2- l - 1 (b-a),a+{k+l)-2- i - 1 (b-a)), k = 0,1,--- ,2 l+1 -l, 
there exists {Mi} such that 

sup sup |/0)-/(y)| <Mi->0 

fc x,yec k 

as 7 — >• 00. Then, for each iei such that > 0, 

log 7^y <log — 



and 



/i(z) , /(a;) + Mj 

log 7w <log -^^° 



then 



1 /i<(.Di, • • ■ ,D n ) 

n Vi(Di, ■ ■ ■ ,D n ) 

, i)°(ainDi,-,a„nD„) P"Cr. n \ 

1_ j q Z^a lr .. ,a„gA, T?"(ni,---,a„) v" 1 ' ' ' Un > 

Z^ai,--- ,a„£A, »7 n (oi,— ,o„) V 1 ' » U ™/ 

< i log X,.„ -> , 

n 

which means that the second term in ( flTt also converges to 
zero as n — > oo. 

From (OtOCLIK we have the first statement <H21l- The 
second equation ( TTol l immediate from Proposition [3] 



as 7 — > oo, where the mean value theorem: if a < r < s < b, 
there exists r < xq < s such that 

1 f s 

/ f(x)dx = f(x ) 

s-r J r 

has been applied. Hence, we have 

I 1 i / (•^1 1 ' ' ' i x n ) . 
n 0S f?(x ir -- ,x n y 



< 



< 



fi(xj 



f(xj) 



1 \- ,, fi{xj)+Mj f( Xj )+M^ 
— } max{ log j- — r , log — — r } 

77 ^ — ' 



f(Xj)fi(Xj) 
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which almost surely converges to a constant Li as n — > oo where Lemma[3]below has been applied for the last inequality, 

such that Hindoo Li = 0. Thus, condition ( TT~4-b is obtained. From Theorem 1, we obtain the final result: 

Example 4 (Countable Sources): Suppose the source is n 
i.i.d. and that X(Q) = N = {1, 2, • ■ ■ }. If we choose rj (X = ~i E J2 \ f r(x)dp^- 1 {x\ Xl , ■ ■ ■ , Xj -x) 

fc) := i - ^ and ^:={{l},...,{i},N-{l, •••,*}}, " ~' 

then 



3=1 

r{x)dv^~ 1 [x\x\^ ■ ■ ■ )Xj-x) 



— / r{x)dv^ 1 (cc|xi , • • • , Xj_i)} 2 

so that (J,i(X = k) — > fj,(X = k) as i — > oo for all ^ 

ld/j", N 26 2 1 /• , „ , , , du n . 

k E N. Thus, — ; — (xi, ■ ■ • , x n ) almost surely converges to < • — / a/i (xi, • • • , x n ) log — — (xi, • • • , x n ) 

n d[i™ ' " log e n J dv n 



a constant Li such that Li — » as i — >• oo. — > 



VII. On-Line Prediction Lemma 3 (Pinsker's inequality Ml/)- 



From Lemma 1 , and since /i J < < p? 1 and v J < < / 2" 

i/J- 1 , there exist conditional measures /i^'" 1 (D^- 1 ) and le?- ~ ~ V toge 
^li-i( J D J |aj' 7 '- 1 ) for £ 6 and x^ 1 € RJ" 1 . The 

conditional measure is used for online prediction. Example 5 (Finite Sources): Let {X n }™ =1 be a stationary 

Theorem 2: Let r : R -> R be a bounded measurable el "g° dic source P °° with each X ™ in a finite set A Let r : 

function then under ( TBI A ^ 1 be a bounded integrable function, then 

l 1 n— i 



J=0 



r(x)d I / j|j '- 1 (x|xi,-- - .xj-x)} 2 = 



>^i-i) 

-x; r W i_i (^i,--- ,^-i)} 2 =o 



and 



and 



I f 

lim — I / r(x)dpJ^~ 1 (x\xi, ■ ■ ■ , 

r(x)di'-''- y-1 (x|xi, • ■ • ,Xj_x)| = 



1 ^ /" ... , lim -E V I Vr(x)P J l J - 1 (x|xi,--- ,x,-_i) 

j=o ieA 

-^(^'-'(iln,- ,^--i)l=o . 



xeA 



where 



Proof: Let b be such that |r(x)| < b, x E R. Then, for each 

j = !,••• ,n, we have P^-\xA Xl , ■ ■ ■ , Xj- X ) — P '( x i>'~ > x J-i> x J) 



{ / r(x)dp°\' J 1 (x|xi, • • • ,Xj-i) and 



r{x)d^-\x\ Xl ,--.,x^)f QV-Hxfa,...,^) := 

< b 2 {fd^ j - 1 (x\x 1 ,---,x j - 1 ) 



VIII. Concluding Remarks 

— / di/ J l J_1 (x|xi, • • ■ ,Xj_i)} 2 We proposed a learning algorithm for nonparametric esti- 

mation and on-line prediction for general stationary ergodic 
< 6 2 {sup| / dp j \ j ~ 1 (x\xi, - ■ ■ , Xj_i) sources. However, we need to know a priori what {^4,} 

A J A satisfies (fl4l) . In this sense, rough knowledge about the true 

— / dv^^ 1 [x\x\ ■■■ x- i)|} 2 mu ^ s re q u i re d- Also, although the sequence {uji} is infinite 
J A to achieve the desired properties such as (fl6ll (fl5l), however. 



25 f i we can only use finite {wi} in reality. 



< / du,^^ -1 (xlxi, ■ ■ ■ , x _i) 

— loge J ' ' 3 Future topic includes a way to choose the sequences {Ai} 

du?\i~ x and {oji] for getting a better solution for finite n by utilizing 

di^b~^ ' X ^ 1 ' ' m e a prior knowledge of the probability measure. 
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Appendix: Measure theory 



Let J 7 be a er-field of entire set ft, and /i, v a-finite measures 
on _F, which means that there exist {Ak} and {Bk} such 
that SI = UkAk = UfeSfe and fi(Ak) < oo,v{Bk) < oo, 
k = 1, 2, • • • . We say g : SI — J> R is ^-masurable if {w G 
f%(u;) efl}e J for each D G B, where B is the Borel 
cr-field. We write // << v when v(A) = ==>• /u(j4) = for 
A G J 7 , and define the integral for ^-measurable g : SI — > K 
over A G J 7 with respect to v by 



where {A^} ranges over the whole partitions of A. 

Lemma 4 (Radon-Nykodim /|2^); Suppose /i << v and that 

d\x 

they are both er-finite. There exists a J 7 - measurable — — := g : 

du 

Vl M such that 



Let (SI, J 7 , /it) be a probability space and v a measure such 
that v(Cl) < 1 and \i«v. 

Definition 1 (Kullback-Leibler information fi5§): Suppos^H 

fj, « v. 



Any ^-measurable map X : fl — » M is said a random variable. 
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for A G T. In particular, 
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